Journal of Medical Imaging — Latest Matching Preprints

1

Retrieval-Augmented Claude Opus 4.7 and GPT-5.5 Surpass Human Performance on the Nuclear Cardiology Board Preparation Exam (and Claude Drafts a Paper About it)

Killekar, A.; Shanbhag, A.; Miller, R. J.; Dey, D.; Bourque, J.; Phillips, L.; Chareonthaitawee, P.; Slomka, P.

2026-05-13 radiology and imaging 10.64898/2026.05.08.26352768 medRxiv

Top 0.1%

5.1%

Show abstract

BackgroundPrevious studies evaluated large language model (LLM) performance on the American Society of Nuclear Cardiology (ASNC) Board Preparation Exam. Without domain-specific context, the best model (GPT-4o) achieved 63.1%, below the estimated 65% passing threshold and the 78% mean score of human fellows-in-training (FITs). Providing textbook context improved GPT-4o to 73.8% on text-only questions, but still fell short of human trainees. Whether next-generation LLMs with retrieval-augmented generation (RAG) can exceed this gap is unknown. MethodsClaude Opus 4.7 and GPT-5.5 were administered all 168 questions (141 text-only, 27 image-based) from the 2023 ASNC Board Preparation Exam across 5 iterations each, using RAG with a nuclear cardiology textbook, companion atlas, and ASNC clinical guidelines. Claude used local FAISS-based semantic retrieval; GPT-5.5 used Azures cloud-hosted vector store. Performance was compared to prior LLM results and 13 human FITs. ResultsAcross 5 iterations, Claude Opus 4.7 achieved a mean accuracy of 86.3% {+/-} 1.4% (text 88.8%, image 73.3%). GPT-5.5 achieved 86.7% {+/-} 2.2% (text 88.5%, image 77.0%) but refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. Both models surpassed the human FIT mean (78.0%) and the estimated passing threshold. Compared to GPT-4o without context (63.1%), this represents a 23-percentage-point improvement in 18 months. ConclusionNext-generation LLMs with RAG now surpass average human trainee performance on nuclear cardiology board preparation questions, suggesting significant potential as educational tools and knowledge-reference aids in cardiovascular imaging. Condensed AbstractAcross 5 iterations each, Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation achieved mean accuracies of 86.3% and 86.7% on the 2023 ASNC Board Preparation Exam (168 questions), both surpassing the mean human fellow-in-training score of 78%. GPT-5.5 refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. These results represent a 23-percentage-point improvement over the best prior LLM without context (63.1%), demonstrating that RAG-enhanced LLMs have reached human-level proficiency in nuclear cardiology knowledge. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/26352768v2_ufig1.gif" ALT="Figure 1"> View larger version (49K): org.highwire.dtl.DTLVardef@5f2465org.highwire.dtl.DTLVardef@4e80d3org.highwire.dtl.DTLVardef@1ebbb93org.highwire.dtl.DTLVardef@167d3c1_HPS_FORMAT_FIGEXP M_FIG C_FIG Overview of the three-study research arc evaluating LLM performance on the 2023 ASNC Board Preparation Exam. Study 1 (2024) tested four LLMs without context (best: GPT-4o, 63.1%). Study 2 (2025) added textbook context to GPT-4o (73.8%). Study 3 (2026, current) evaluated Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation across 5 iterations each (mean 86.3% and 86.7%, respectively), both surpassing the human fellow-in-training mean of 78%. Right panel shows the performance scale with key thresholds.

2

Scan length as a major driver of CT radiation dose: a diagnostic reference level audit from Kosovo

Rudi, G.; Vula, F.; Bicaku, A.; Dedushi, K.; Ahmetgjekaj, I.

2026-05-17 radiology and imaging 10.64898/2026.05.12.26353024 medRxiv

Top 0.1%

2.9%

Show abstract

Computed tomography is the largest contributor to population radiation dose from medical imaging, yet no diagnostic reference levels (DRLs) have been published from Kosovo or the Western Balkans. This retrospective audit analyzed all CT examinations performed on a 128- slice scanner at the University Clinical Centre of Kosovo between January and March 2026. After exclusions, 1,535 acquisitions from 1,092 patients across nine examination categories were analyzed. Local DRLs were defined as the 75th percentile and compared against German (BfS 2022) and Turkish (Kahraman et al., 2024) reference values. Head CT (n = 590) demonstrated CTDIvol 4.7% below the BfS DRL yet scan length 98.5% above the orientation value (median 25.8 vs 13 cm). Abdomen-pelvis CTDIvol matched the BfS reference while scan length exceeded it by 28%. Coronary CTA showed CTDIvol +377%, consistent with retrospective ECG gating. Excess scan length, not CTDIvol, is the major driver of elevated dose at this institution. The identified excesses are correctable through technologist landmarking training, protocol review, and enabling iterative reconstruction.

3

Left Ventricular Volume and Function Assessment Using a Reduced-Slice Approach in Cardiovascular Magnetic Resonance

Tejaswi, A.; Fyrdahl, A.; Sigfridsson, A.

2026-06-01 cardiovascular medicine 10.64898/2026.05.29.26354413 medRxiv

Top 0.1%

2.7%

Show abstract

Background: Cardiovascular magnetic resonance (CMR) quantification of the left ventricular (LV) volumes and ejection fraction (EF) typically involves manual segmentation of many short axis (SAx) and long axis (LAx) slices of the left ventricle. The scan time and the number of breath holds is proportional to the number of slices. We aimed to evaluate a geometric model of the left ventricle that could enable planimetry from a reduced number of slices. We sought to determine whether acceptable accuracy was retained for evaluating the End Diastolic Volume (EDV), End Systolic Volume (ESV), Stroke Volume (SV), and EF to provide a rapid and reliable clinical alternative. Methods: A cohort of 342 patients, median age: 54 (40 - 65) years, with full-stack CMR examinations was used. Nine geometrical combinations were evaluated: 3, 4 or 5 short axis slices and one of three LAx orientations (2-chamber, 3-chamber or 4-chamber) by retrospectively decimating the full-stack acquisition. LV volumes were calculated as a sum of trapezoidal approximations for apical and mid-cavity slices and a generalized prismoidal model at the base. The accuracy of the volume calculations was quantified against the full-stack reference for the EDV, ESV, SV, and EF using concordance correlation coefficient (CCC), two-way repeated measures ANOVA, pairwise tests, and Bayes factor log10(BF10) analysis. Results: The choice of the long axis (LAx) view was the most influential driver of accuracy (g2 = 0.104, for EDV), approximately 50 times more impactful than the number of SAx slices (g2 = 0.002, for EDV). Volumes calculated using the combination of 2-chamber LAx view and 5 SAx slices had the highest concordance with the full stack (CCC>0.90). While the estimated absolute volumes displayed a systematic negative bias, EF and SV remained highly robust due to bias cancellation. For a 2ch + 5 SAx protocol, EF bias was just 0.83% (LoA: -6.18 to 7.84%), with a minimum detectable change (MDC) of 7.01%, compared to 8.7% reported for expert human readers, suggesting strong concordance. Bayesian paired-samples t-tests yielded log10(BF10) = 6.42 in favor of 5 SAx over 3 SAx, constituting decisive evidence on the Jeffreys scale. The bias and limits of agreement (LoA) for stroke volume and ejection fraction were found to be lower than scan-rescan reproducibility in literature. Conclusion: This reduced-slice geometric model allows for reduced number of breath holds compared to a conventional full-stack CMR acquisition and provides an acceptable accuracy with bias less than scan-rescan variability.

4

TopBrain Segmentation Challenge for Whole Brain Vessel Anatomy

Yang, K.; Shi, P.; Huang, H.; Musio, F.; Baazaoui, H.; Aydin, O. U.; Hilbert, A.; Hamadache, R. E.; Yalcin, C.; Zhang, M.; Falcetta, D.; de la Rosa, E.; Shit, S.; Prabhakar, C.; Wittmann, B.; Rokuss, M. R.; Kirchhoff, Y.; Al-Maskari, R.; Hoeher, L.; Juchler, N.; Casamitjana, A.; Cleary, J.; Schmick, A.; Baumgartner, P.; Deseoe, J.; Vandans, O.; Lee, D.; Oh, K.; LaBella, D.; Mazher, M.; Niederer, S. A.; Qayyum, A.; Liu, Y.; Chen, J.; Kim, W.; Asawalertsak, N.; Kim, M.; Shin, D.; Park, S.-H.; Kikuchi, S.; Zhang, Y.; Liu, J.; Cui, Y.; Qiu, Y.; Verschuur, A.; Zhang, J.; van der Schaaf, I.; Su, R.;

2026-05-30 radiology and imaging 10.64898/2026.05.28.26354312 medRxiv

Top 0.1%

1.8%

Show abstract

We present the TopBrain 2025 Challenge, the first benchmark for fine-grained multiclass segmentation of the whole brain vasculature in both computed tomography angiography (CTA) and magnetic resonance angiography (MRA). Building on the TopCoW challenge, TopBrain scales vessel annotation from the Circle of Willis to the entire brain, introducing a dataset of 90 annotated volumes across 48 landmark vessel classes spanning arterial and venous systems, of which 50 training volumes are publicly released. Vessel definitions were consolidated from established neuroanatomical references into a unified annotation scheme, and vessel caliber measurements along the centerline are reported for the first time across the whole brain vascular anatomy. To address the unique challenges of multiclass brain vessel segmentation, we propose an evaluation framework that accounts for detection in segmentation performance, assesses anatomical plausibility, and introduces novel contamination metrics that characterize inter-class prediction errors. Fifteen teams from over 220 registered participants submitted algorithms to the benchmark. The top-performing teams built on nnUNet with principled system design choices, achieving around 80% Dice scores, near-zero invalid neighbor counts, over 60% F1 scores for side-road vessels, and below 18% foreground contamination ratio. Larger vessels are easier to segment, while smaller and more complex vessels remain the true bottleneck. The annotated datasets and podium-finish algorithms are made publicly available on Zenodo.

5

Vascular Deformation Mapping Calibration with Physics-based Synthetic Data on Multi-axial Aortic Motion

Kim, T.; Baker, T.; Burris, N.; Figueroa, A.

2026-05-22 bioengineering 10.64898/2026.05.20.726669 medRxiv

Top 0.1%

1.8%

Show abstract

Aortic stiffness is both heterogenous and anisotropic. Current non-invasive methods to estimate aortic stiffness are limited to characterizing the aortic tissue as isotropic due to the lack the techniques required to extract multi-axial strain from 3D dynamic images. Vascular deformation mapping (VDM) is a nonrigid image registration technique which has thus far been applied to map aortic growth using longitudinal imaging. In this study, we propose to use VDM to assess 3D aortic deformation by mapping diastolic and systolic images. During image registration process, penalty parameters are employed to fine-tune image alignment and penalize non-physiological deformations. These penalty parameters must be calibrated to ensure that VDM successfully reproduces multi-axial aortic motion patterns in health and disease. In this paper, we developed a calibration pipeline for these parameters using synthetic data. A rotation-free shell model was used to generate physics-based synthetic data on aortic motion incorporating patient-specific geometries, root motion, and blood pressure from a cohort of 14 subjects (healthy, Marfans syndrome and thoracic aortic aneurysm). An error metric was defined to quantify the quality of the VDM results. Furthermore, a k-means clustering technique was used to categorize the subjects into three clusters based on ascending aortic motion. Optimal penalty parameters were identified for each of the three clusters. The results indicated that patient clusters with smaller aortic root motion required larger rigidity penalty values. The calibrated parameters successively reduced errors in 3D displacement and multi-axial stretch compared to un-optimized VDM predictions, enhancing the accuracy of capturing aortic deformation from dynamic images. Among the different aortic regions, the ascending thoracic aorta exhibits the largest error reduction.

6

AI-Based Coronary Artery Calcification on Non-contrast CT: Performance Across Calcium Scoring, Lung Cancer Screening, and Liver Transplant Candidate Cohorts

Ludwig, K. D.; Hatt, C. R.; Keith, L.; Matyga, A. W.; Te, H. S.; Landeras, L.; Chelala, L.; Patel, A. R.; Chung, J. H.

2026-05-15 radiology and imaging 10.64898/2026.05.12.26352904 medRxiv

Top 0.1%

1.8%

Show abstract

Objective: Coronary artery calcification (CAC) assessment for cardiovascular risk stratification is traditionally achieved using ECG-gated computed tomography (CT). Automated deep-learning (DL) algorithms may streamline opportunistic CAC detection and scoring, particularly on non-gated CT scans. This study evaluated the performance of a fully automated DL-based CAC scoring algorithm ("DL-CAC") against expert human scoring. Methods: The algorithm was trained on 1,260 chest CT scans from multiple databases to automatically identify coronary calcium, calculate Agatston scores, and assign a cardiovascular disease (CVD) risk classification. Performance was assessed on a holdout dataset (n=500) comprising ECG-gated calcium scoring CT scans and lung cancer screening non-gated chest CTs as well as in an external, independent CT dataset (n=129) from liver transplant candidates. Agreement with expert scoring was assessed using intraclass correlation coefficient (ICC) for Agatston scores and Cohen's {kappa} for CVD risk classification. Results: The algorithm demonstrated high agreement with expert scoring in the pooled calcium scoring and lung cancer screening cohorts, with an ICC of 0.947 for Agatston scores and {kappa} of 0.936 for CVD risk classification. For liver transplant candidates, the algorithm exhibited substantial agreement with expert scoring of non-gated CT scans ({kappa}=0.79) and a sensitivity of 90.4% and specificity of 96.4% in high-risk cases. Conclusion: These findings suggest that DL-based CAC scoring on non-gated CT scans may be a feasible alternative to traditional methods and could support opportunistic cardiovascular risk assessment in routine imaging. Further validation is warranted to assess clinical integration in broader practice settings.

7

Deep Learning for Automated Meningioma Segmentation: Toward Clinical Integration and Workflow Efficiency

Fenney, E.; Muralidharan, L.; Ruffle, J. K.; Pandit, A.; Millip, M.; Hammam, A.; Brookes, T.; Jabeen, F.; Colman, J.; Sarwani, O.; Alattar, K.; Efthymiou, E.; Kallam, N.; Siddiqui, J.; Marcus, H. J.; Nachev, P.; Hyare, H.

2026-05-15 neurology 10.64898/2026.05.12.26352585 medRxiv

Top 0.1%

1.7%

Show abstract

Background: Meningiomas are the most common primary intracranial tumors in adults, and volumetric assessment increasingly guides surveillance and treatment decisions. Automated segmentation could enable standardized volumetry but requires robust validation. Purpose: To develop a fully automated three-dimensional deep learning model for meningioma segmentation on multiparametric MRI, and to evaluate segmentation accuracy, external generalizability, failure modes, radiologist-rated clinical plausibility, and workflow feasibility. Methods: From 2024 to 2026, this retrospective study trained a custom 3D nnU-Net residual encoder model. Expert segmentations covered enhancing tumor (ET), tumor core (TC), and whole tumor (WT). Dice similarity coefficient (DSC) was the primary metric. External validation used an independent single-institution dataset (n = 310 intracranial cases) with incomplete MRI protocols. Failure modes, model equity, and inference time were assessed. A blinded multi-rater study (10 radiologists; 510 cases) rated TC segmentations using a 0-10 Likert scale, analyzed with linear mixed-effects models. Results: Model training used the BraTS Meningioma 2023 dataset (n = 1000; mean age 60.2 {+/-} 14.5; 705 female). In cross-validation, mean DSC was 0.939 for ET, 0.937 for TC, and 0.921 for WT. In external validation, mean DSC was 0.872 for TC and 0.842 for WT, despite heterogeneous protocols and incomplete sequences. Predicted TC volumes correlated strongly with reference volumes in cross-validation (r = 0.995) and external validation (r = 0.971). Most common failure modes were skull base and intraosseous tumors with performance equitable across demographic subgroups. Mean inference time was 1.2 seconds. In blinded evaluation (1120 ratings), model segmentations received higher scores than reference annotations (+0.32 BraTS; +1.38 external validation). Conclusion: A fully automated deep-learning model achieved high meningioma segmentation accuracy across multi-institutional training data and external clinical imaging. In a blinded study, model segmentation quality exceeded reference annotations, and 1.2-second inference supported workflow integration. Prospective evaluation is warranted before routine deployment.

8

Age-related nonlinear trajectories of abdominal organ volumes on CT: a longitudinal study

Nomura, Y.; Hanaoka, S.; Nakao, T.; Yamagishi, Y.; Kikuchi, T.; Sonoda, Y.; Miki, S.; Oba, K.; Yoshikawa, T.; Abe, O.

2026-05-08 radiology and imaging 10.64898/2026.05.06.26352299 medRxiv

Top 0.1%

1.7%

Show abstract

ObjectivesTo characterize longitudinal age-related changes in abdominal organ volumes using CT volumetry and to model nonlinear trajectories across multiple organs. Materials & MethodsThis retrospective single-center study included adults who underwent whole-body screening low-dose CT between 2006 and 2017. Subjects with at least eight examinations during a follow-up period of at least 78 months were included. After applying exclusion criteria, 700 participants with 6,739 CT series were analyzed. Non-contrast CT images were processed using automated organ segmentation, and volumes of the liver, pancreas, spleen, and kidneys were quantified. Longitudinal changes were modeled using generalized additive mixed models with sex-specific smooth functions of age and subject-level random effects. Age-dependent rates of change were estimated from model derivatives. ResultsA total of 700 participants (mean age, 56.9 {+/-} 9.8 years, 29.6% women) were evaluated. Liver, pancreas, and kidney volumes showed mild increases or plateaued at approximately 40-60 years of age, depending on the organ, and were followed by gradual declines with advancing age, whereas splenic volume showed a progressive decrease across the age range. These patterns showed nonlinear age dependence. The transition from positive to negative change rates tended to occur earlier in women than in men for several organs, particularly the liver and kidneys. ConclusionLongitudinal CT analysis demonstrated nonlinear age-related changes in abdominal organ volumes, with organ-specific trajectories and sex-related differences in the timing and magnitude of volume changes. QuestionHow do abdominal organ volumes change longitudinally with age, and can their trajectories be characterized for each organ? FindingsLongitudinal CT analysis demonstrated nonlinear, organ-specific volume trajectories, with transitions from stability to decline around 40-60 years and earlier transitions in women than men. Clinical RelevanceLongitudinal reference patterns of abdominal organ volumes on CT improve the interpretation of age-related changes and support more accurate differentiation between physiological variation and disease-related volume alterations.

9

Failure detection in medical image classification under realistic distribution shifts: A large-scale benchmark

Steinmetz, P.; Frouin, F.; Morard, V.; Buvat, I.

2026-05-05 radiology and imaging 10.64898/2026.05.04.26350496 medRxiv

Top 0.1%

1.7%

Show abstract

Medical images (MI) exhibit variability due to different acquisition protocols, devices, and patient populations, making failure detection at inference time essential for reliable deployment of clinical classifiers. As existing evaluations of failure detection methods use different settings, it is difficult to compare results and identify the best strategy, if any. We present a comprehensive benchmark of eight confidence scoring functions and two score-aggregation strategies across eight MI tasks spanning diverse modalities, backbone architectures, training setups, and failure sources. The confidence ranking ability and classification error mitigation are jointly evaluated. While no single method systematically dominated across settings, aggregation of confidence scores consistently matched or approached the best individual method and substantially reduced silent failure rate. The failure detection performance was strongly correlated with classifier accuracy for all tested settings. These findings provide large-scale evidence regarding the strengths and limitations of confidence scoring strategies and offer actionable guidance for mitigating silent failures under realistic distribution shifts in MI.

10

An Automated CT-derived Marker of Renal Tumor Complexity: The CLARITY Score

Jonnalagadda, R.; Patel, S. H.; Abusafieh, H. T.; Seshadri, R.; Jevnikar, D.; Younis, S.; Al-Bayati, A.; Saputro, N.; Knorr, J.; Wang, B.; Ozery-Flato, M.; Rosen-Zvi, M.; Abouassaly, R.; Remer, E.; Heller, N.; Weight, C.

2026-05-12 urology 10.64898/2026.05.08.26352647 medRxiv

Top 0.2%

1.0%

Show abstract

Background and ObjectiveSurgical complexity for renal tumors has traditionally been assessed using manual nephrometry scores, which require unreimbursed physician effort and are subject to interobserver variability. This study introduces an objective, fully automated alternative derived from decades of experience at a large academic center. MethodsWe trained a CT classification model to predict whether a patient would ultimately undergo Partial or Radical Nephrectomy (PN or RN). We hypothesized that the models confidence in RN (termed the CLARITY score) would serve as a surrogate for the difficulty of nephron-sparing approaches and thus for tumor complexity. This hypothesis was tested using multivariate logistic regression for failure to achieve trifecta, estimated blood loss (EBL) [≥] 500 mL, and length of stay [≥] 3 d. CLARITY was compared with tumor size and R.E.N.A.L. score. External validation in a geographically distinct cohort was performed. Key Findings and LimitationsFor predicting RN, CLARITY achieved an AUROC of 0.899 internally and 0.898 externally. In the external PN subgroup, it outperformed tumor size and R.E.N.A.L. score in predicting failure to achieve trifecta (AUROC 0.613), EBL [≥] 500 mL (0.727), and length of stay [≥] 3 d (0.673). In multivariable analysis, CLARITY remained associated with each outcome, whereas R.E.N.A.L. and size were not. This study is limited by its retrospective design. Conclusions and Clinical ImplicationsCLARITY is an automated CT-derived marker that quantifies renal tumor complexity more effectively than tumor size and R.E.N.A.L. score and may support scalable, objective preoperative complexity assessment. To support reproducibility and external validation, we have released a public inference pipeline and web-based DICOM upload portal for research use.

11

DISCERN: A Clinical Impact-aware Framework for Radiology Report Comparison

Sharma, R.; Beeche, C.; Dong, J.; Zhuang, R.; Qu, H.; Zhang, R.; Gangaram, V.; Goswami, P.; Xin, J.; Ballard, J.; Goldberg, A.; Sagreiya, H.; Long, Q.; Chen, T.; Witschey, W. R.

2026-05-27 radiology and imaging 10.64898/2026.05.26.26353612 medRxiv

Top 0.2%

1.0%

Show abstract

The surge in medical imaging has spurred the development of vision-language models (VLMs) to alleviate radiologist workloads. However, clinical deployment is hindered by the lack of meaningful evaluation frameworks. Current metrics - ranging from semantic similarity to large language model (LLM) based judges - often fail to distinguish between clinically trivial and critical discrepancies, poorly reflecting real-world clinical judgment. To address this, we introduce DISCERN (Discordance and Significance-aware Entity-level Radiology Report Comparison). DISCERN is a significance-aware framework that weighs report errors based on their potential impact on patient care. Our results demonstrate that DISCERN powered by closed source LLMs aligns more closely with expert radiologist assessments than traditional metrics or current LLM evaluators, providing a more interpretable and clinically relevant benchmark. By modeling radiologist prioritization and entity-level feedback, DISCERN facilitates targeted model refinement and ensures the safer integration of generative AI into clinical workflows.

12

Unsupervised Tissue Concepts for Explainable Sarcoma Subtype Prediction from H&E

Bisson, T.; Ingram, D.; Singh, S.; Li, A.; Flynn, S.; Wang, W.-L.; Kim, A. E.; Bridge, C. P.; Demicco, E. G.; Sorrentino, A.; Jiang, S.; Hung, Y. P.; Lazar, A. J.; Iafrate, A. J.

2026-05-20 pathology 10.64898/2026.05.15.26353333 medRxiv

Top 0.2%

0.9%

Show abstract

Soft tissue sarcomas are a rare, heterogeneous group of tumors whose diagnosis remains challenging because of overlapping morphology and limited access to sarcoma-specialized pathologists. Although pathology foundation models have shown promise in computational pathology, their clinical translation remains limited by insufficient interpretability, particularly in diagnostically complex settings such as sarcoma diagnosis. Here, we developed and evaluated an H&E-based AI framework for sarcoma subtype classification that focused on explanability. Using the CONCH v1.5 foundation model, we computed embeddings from a tissue microarray cohort of 2,545 cases spanning 19 sarcoma subtypes and trained an attention-based multiple-instance learning model that achieved a balanced accuracy of 77.38% (SD 1.88). To move explainability beyond attention-based localization, we trained a sparse autoencoder on patch-level embeddings to learn 768 recurring visual concepts. 90 high-activation concepts were reviewed by three senior pathologists and curated into morphologically meaningful and non-meaningful categories, yielding a semantic dictionary of 41 diagnostically relevant tissue concepts. We then trained a linear attention-based model on the 768-concept vectors, which retained much of the performance of the raw embedding-based ABMIL model, achieving a balanced accuracy of 73.74% (SD 1.30). When restricting the linear model to pathologist-curated morphologic concepts only, balanced accuracy further decreased to 67.04% (SD 1.27), suggesting that the residual performance gain in the full concept model was driven by inconsistent, technical, or diagnostically irrelevant concepts. Concept-level explanations of the curated linear attention-based model aligned with known sarcoma morphology, including lipogenic, myxoid, spindle-cell, pleomorphic, vascular, small round blue cell, and matrix-forming patterns, and reproduced patterns of diagnostic overlap observed in human sarcoma pathology. Together, these results show that H&E-based foundation-model representations capture meaningful diagnostic structure within the known limitations of H&E in sarcoma diagnostics, but that their clinical value depends on whether this structure can be made interpretable to pathologists. Sparse autoencoder-derived concepts can address this critical gap by converting embedding-level signal into recurring morphologic patterns that pathologists can review and name, providing the foundation to link these patterns to subtype predictions. In doing so, this approach turns concept discovery into a practical form of diagnostic explanation, while also revealing where model performance is supported by recognizable histopathology and where it relies on diagnostically irrelevant or inconsistent visual patterns.

13

Opportunistic CT Attenuation Biomarkers of Anemia Are Associated With Impaired Myocardial Flow Reserve and Cardiovascular Outcomes

Miller, R. J.; Shanbhag, A.; Yi, J.; Kwiecinski, J.; Kavanagh, P.; Ramirez, G.; Lemley, M.; Kamagate, A.; Slipczuk, L.; Travin, M. I.; Alexanderson, E.; Carvajal-Juarez, I.; Packard, R. R. S.; Al-Mallah, M.; Einstein, A. J.; Acampa, W.; Knight, S.; Le, V. T.; Mason, S.; Wopperer, S.; Chareonthaitawee, P.; Rosamond, T. L.; DeKemp, R. A.; Buechel, R. R.; Berman, D. S.; Dey, D.; Di Carli, M. F.; Slomka, P.

2026-05-19 radiology and imaging 10.64898/2026.05.14.26353239 medRxiv

Top 0.2%

0.9%

Show abstract

Background: Anemia is an established marker of cardiovascular disease severity and risk which leads to elevations in resting myocardial blood flow (MBF) and impaired myocardial flow reserve (MFR) in patients without obstructive coronary artery disease (CAD). Anemia can potentially be detected opportunistically from blood pool density changes on computed tomography (CT) imaging. Objectives: We evaluated relationships between chamber density measurements with hemoglobin, positron emission tomography (PET) findings, and cardiovascular events. Methods: We included 33460 patients from 13 sites in the REFINE-PET who underwent PET and 24368 patients undergoing lung cancer screening chest CT. A deep learning model segmented cardiac chambers from CT images, then quantified chamber density. We evaluated the relationship between chamber density measures with resting MBF and MFR, as well as associations with death or myocardial infarction (MI). Results: We included a total of 57,828 patients. A higher density in myocardium compared to left ventricle blood pool was associated with reduced MFR (adjusted odds ratio 3.02 per SD increase, 95% confidence interval[CI] 2.72 - 3.38) and an increased risk of death or MI in (adjusted hazard ratio[HR] 1.38 per SD increase, 95% CI 1.26-1.51). Having myocardial density higher than blood pool density was also associated with cardiovascular death in patients undergoing low-dose chest CT (adjusted HR 1.73, 95% CI 1.20-2.52). Conclusions: In a large multimodality dataset, lower cardiac chamber density is associated with impaired MFR and independently associated with cardiovascular events. These biomarkers can be automatically extracted from CT to provide physiologic insights and potentially guide patient care.

14

Bridging Cotyledon Pathology and Perfusion in Healthy Primate Pregnancy

Keding, L. T.; Liu, R.-Y.; Keding, T. J.; Vazquez, J.; Bockoven, C. G.; Shah, D. M.; Golos, T. G.; Wieben, O.; Stanic, A. K.

2026-05-21 pathology 10.64898/2026.05.18.726079 medRxiv

Top 0.2%

0.9%

Show abstract

IntroductionHealthy and diseased placentae alike often display some degree of pathology. However, quantitative techniques to characterize common pathologies and their relationship to local maternal hemodynamics in healthy primate placentae are currently limited. MethodsPlacentae from seven rhesus macaques were imaged by MRI at three time points across mid-to late-gestation, to quantify placental blood volume, flow, and perfusion from maternal spiral arteries across pregnancy. Near term, we collected placental cotyledons, digitized hematoxylin/eosin-stained slides, then segmented and annotated sub-tissues and major pathologies (intervillous gaps, fibrin deposition, villous agglutination, inflammatory agglutination, and stromal mineralization) within each cotyledon. Individual pathologies were assessed in relation to each other and MRI perfusion metrics, in a cotyledon-specific manner. Parallel analyses were performed to investigate both basic (Spearman correlation) and animal variance-negated (dimensionality-reduction) relationships. ResultsCotyledons with increased stromal mineralization demonstrated low blood perfusion across pregnancy, alongside significant compensatory changes. Mineralization was further associated with decreased fetal weight, across all sub-tissues. Dimensionality reduction revealed maternal vascular malperfusion-associated pathologies as the largest contributor to dataset variance. Additionally, pathologies commonly associated with healthy placental function demonstrated low cotyledon blood flow and volume at all timepoints, with no evidence of compensatory changes across gestation. ConclusionsComprehensive digital annotation revealed several relationships connecting pathology and maternal blood perfusion in the healthy primate pregnancy, at the smallest functional unit of the placenta. This methodological framework embeds pathologist-refined morphological expertise into a quantitative, spatially resolved format that can ground, rather than be replaced by, unsupervised computational approaches to placental analysis.

15

PIE Toolbox: SSM-PCA Based Software for PET Diagnostic Pattern Analysis

Romanov, M.; Kireev, M.; Didur, M.; Cherednichenko, D.; Korotkov, A.; Valdes-Sosa, P.; Fan, Q.; Wang, Q.

2026-06-01 radiology and imaging 10.64898/2026.05.28.26354341 medRxiv

Top 0.3%

0.9%

Show abstract

One of the prominent methods in neuroimaging data processing is SSM-PCA, which is based on principal component analysis and allows for the identification of diagnostically significant patterns in the form of statistical maps. We developed software, PIE Toolbox, employs SSM-PCA and classification based on the obtained diagnostic patterns revealed from functional and structural tomographic brain imaging. The program supports the entire analysis pipeline including preprocessing of brain images, diagnostic patterns extraction, building classification models, and prediction based on them. The resulting diagnostic patterns are weighted principal components obtained through SSM-PCA, or their linear combinations. PIE Toolbox allows selection of relevant structural and functional brain patterns, computation of their expression values in regions of interest, classification using support vector machines, and evaluation of model performance via cross-validation. This approach enables the use of patterns as features of intergroup differences for individual diagnosis. The software has been validated on both simulated and ADNI datasets.

16

Economic costing of evaluating, deploying and monitoring an artificial intelligence-based reconstruction for acceleration of rectal MRI examinations

Harrison, C. A.; Wu, M.; White, O.; Hopkinson, G.; Hughes, J.; Robertson, S.; Scurr, E.; Shur, J.; Castagnoli, F.; Charles-Edwards, G.; Koh, D.-M.; Winfield, J.

2026-05-21 radiology and imaging 10.64898/2026.05.18.26353474 medRxiv

Top 0.3%

0.8%

Show abstract

Objectives: AI-based reconstructions can reduce MRI acquisition times and/or improve image quality. Guidelines recommend clinical evaluations and post-deployment monitoring of these novel methods, however, there has been little investigation of the clinical resources required for such assessments. The aim of this study was to evaluate the healthcare resource utilisation and potential savings associated with AI-based reconstructions in rectal MRI. Methods: A retrospective economic costing analysis was conducted from the NHS healthcare perspective. Resource utilisation data were extracted from the Electronic Patient Records for 9 healthy volunteer scans and 104 rectal MRI examinations evaluating an AI-based reconstruction. The resource profile included the MRI scan and the staff time required for data acquisition and analysis. Results: The clinical evaluation of the AI-based reconstruction cost {pound}15,023. Deployment of the AI-based reconstruction reduced the length of an MRI rectum scan by 22 minutes, theoretically saving approximately {pound}3,437 per month. Addition of post-deployment quality control scans reduced this monthly saving to {pound}2,636. If the quality control scans were evaluated using radiologists rather than image quality metrics, monthly savings would be approximately {pound}2,541. With ongoing quality control, the clinical evaluation cost would be recouped between 5.8 and 6 months, compared with 4.4 months without ongoing quality control. Conclusions: Deploying AI-based reconstructions can yield cost savings through reduced scanning times. Quality control tests using image quality metrics would save radiological burden and reduce costs compared with conducting repeated image scoring by radiologists.

17

An Experimental Investigation of the Relationship between AI-Human Workflow Design and Legal Liability for Radiologists: The Erroneous-Change Penalty and Omission Bias

Song, E. C.; Bernstein, M. H.; Sheppard, B.; Bruno, M. A.; Baird, G. L.

2026-05-22 radiology and imaging 10.64898/2026.05.20.26353717 medRxiv

Top 0.3%

0.8%

Show abstract

Background: With growing impetus to integrate artificial intelligence (AI) tools into radiology, clinical practices must navigate workflow redesign. This carries implications for medical malpractice liability. Methods: We conducted an online vignette experiment with United States adults who acted as hypothetical jurors in a malpractice case involving a missed intracranial hemorrhage. Participants (n=2,347) were randomized to one of 22 conditions: a no-AI control and 21 conditions involving a hypothetical AI system. These twenty-one conditions varied by whether (1) a single-read or double-read workflow was used, (2) the radiologist's initial interpretation was documented, (3) the radiologist changed their interpretation after viewing AI output, (4) the AI detected the abnormality, and (5) the AI error rate--False Discovery Rate (FDR) or False Omission Rate (FOR--was provided to participants only, both participants and radiologist, or neither. The primary outcome was perceived liability, assessed by whether the radiologist met their duty of care. Findings: Perceived liability differed across conditions (p<0.0001). Double-read workflows (p<0.0001), documenting initial interpretations (p=0.0125), and providing participants with AI error rates, including the FDR (p=0.0038) or FOR (p=0.0035), reduced perceived liability. Liability was also lower when AI was incorrect (p<0.0001). Radiologists' awareness of AI error rates did not significantly impact liability. Notably, we observed an erroneous change penalty: the greatest liability occurred when radiologists initially identified an abnormality but later changed their interpretation to normal after seeing that AI identified the case as normal; conversely, perceived liability was lowest with documented, double-read workflows. Interpretation: Double-read workflows with documented initial interpretations and disclosure of AI error rates reduce perceived liability, though changing a correct initial interpretation increases it. Strategic workflow design is critical for successful AI implementation that can mitigate malpractice risk.

18

Revisiting the Structure of the Ventricular Myocardium in Tetralogy of Fallot Using Hierarchical Phase Contrast Tomography and Structure Tensor Analysis

Sabarigirivasan, V.; Brunet, J.; Dejea, H.; Crucean, A.; Jegatheeswaran, A.; Bosi, G.; Urban, T.; Chestnutt, L.; Purzycka, J.; Tafforeau, P.; Friedberg, M. K.; Lee, P. D.; Cook, A. C.

2026-05-04 physiology 10.64898/2026.04.29.721688 medRxiv

Top 0.3%

0.8%

Show abstract

BACKGROUNDIn tetralogy of Fallot (ToF), changes to right ventricular (RV) function (as seen by strain or TAPSE) relate to altered myocardial structure. Direct three-dimensional anatomical evidence supporting these changes remains limited. OBJECTIVESTo non-destructively characterize myocardial architecture in pediatric ToF hearts using Hierarchical Phase-Contrast Tomography (HiP-CT) and structure tensor analysis. METHODSTwenty ToF and control pediatric hearts were imaged at the European Synchrotron, ESRF. Myocyte orientation was assessed through structure tensor analysis and distributed high-performance computing. A region-specific framework was developed for analysis of the RV. The predominant direction of myocardial aggregates (their helical angle) was compared across ventricular regions. RESULTSSignificant differences in orientation were found in all ToF segments vs controls (left ventricle, RV inlet, RV outflow tract, septum; p < 0.001). Myocytes in the ToF RV inlet were more circumferential overall, with regional heterogeneity. Contrary to traditional models, no discrete middle layer was found in the ToF RV, instead, a shift towards more circumferentially orientated myocytes and disrupted septal and outflow components was observed. RV contribution to the septum was greater in ToF (47.3% vs 34.0% ; p = 0.0026) with extension of ventricular insertion points disrupting septal architecture. There were more longitudinally oriented myocytes in the ToF RVOT, consistent with hypertrophied septo-parietal trabeculations. LV structure in ToF demonstrated a greater proportion of circumferentially oriented myocytes vs controls. CONCLUSIONSWe reveal profound alterations in ToF myocardial organization which broadly align with clinical observations and provide the first open-access HiP-CT congenital heart disease data as a basis for future computational modelling.

19

Assessing Foundation Models for Computational Pathology in Endometrial Cancer

Volinsky-Fremond, S.; van den Berg, N.; Barkey Wolf, J.; Schoenpflug, L. A.; Andani, S.; Ortoft, G.; Jobsen, J. J.; Lutgens, L. C.; Powell, M. E.; Mileshkin, L. R.; Mackay, H.; Leary, A.; Razack, R. R.; de Bruyn, M.; de Boer, S. M.; Nout, R. A.; Smit, V. T.; Creutzberg, C. L.; Koelzer, V. H.; Bosse, T.; Horeweg, N.

2026-05-25 pathology 10.64898/2026.05.22.26353897 medRxiv

Top 0.3%

0.8%

Show abstract

Computational pathology leverages deep learning to extract clinically relevant information from digitized tumor slides, predicting histopathological subtypes, molecular alterations, and patient outcomes. Recent pipelines increasingly rely on foundation models trained on large pan-cancer datasets to generate generalizable features. In endometrial cancer (EC), their comparative performance for clinical diagnostic tasks remains unexplored. For the first time, this study evaluates the performance of seven state-of-the-art foundation models across morphological, molecular, and prognostic tasks using a large EC dataset of 3,293 patients from randomized trials and clinical cohorts. In addition, their performance was compared to one model (EsVIT) exclusively trained on EC. The foundation models H-OPTIMUS-0, CONCH, and VIRCHOW2, achieved the highest mean performance, but the best-performing foundation model varied by task. The top-performing foundation model outperformed the EC-specific feature extractor EsVIT across all tasks. This study highlights the superiority of foundation models over a domain-specific feature extractor in EC. Selecting the optimal foundation model for novel tasks remains challenging due to performance plateaus and limited information on the training datasets, requiring rigorous benchmarking and domain insight to reach maximum potential.

20

Consensus-based technical recommendations for clinical translation of renal Dynamic Contrast-Enhanced (DCE) MRI

Gunwhy, E. R.; Kurugol, S.; Serai, S.; van der Molen, A. J.; Abou El-Ghar, M.; Buckley, D. L.; Hockings, P. D.; Jones, R. A.; Lim, R. P.; Mendichovszky, I. A.; Pedersen, M.; Reynolds, H. M.; Sanmiguel Serpa, L. C.; Wentland, A.; Zoellner, F. G.; Sourbron, S.; Dekkers, I. A.

2026-05-14 radiology and imaging 10.64898/2026.05.11.26352525 medRxiv

Top 0.4%

0.7%

Show abstract

BackgroundDynamic contrast-enhanced (DCE) MRI has the potential to be a useful tool for non-invasively assessing renal haemodynamics and function, however insufficient standardisation and difficulties in post-processing remain barriers to clinical translation. PurposeTo develop expert consensus-based technical recommendations for performing renal DCE-MRI in humans, relating to aspects of patient preparation, MRI hardware and acquisition parameters, and data analysis. Study TypeSystematic consensus process using an approximation to the two-step modified Delphi method. PopulationNot applicable. Field Strength / Sequence1.5 T and 3 T / Renal gradient echo-based 3D DCE-MRI. AssessmentAn international panel of experts were recruited and surveyed following a modified Delphi method to create consensus-based technical recommendations. Key areas for consensus were initially identified through a mixture of online and in-person discussions, and an initial survey round consisting of open- and close-ended questions. Consensus statements were formulated and iteratively refined to create the final recommendations. Statistical TestsConsensus was defined as [≥] 75% agreement in response (excluding abstentions), and clear preference was defined as [60-74]% agreement among the experts. Statements with [≥]40% abstentions were either excluded from subsequent survey rounds or recirculated as a modified statement. Results22 experts initially participated in the Delphi panel, of which 16 responded to the first survey. 15 panellists responded to all subsequent surveys. Out of 46 statements, 37 reached consensus and one showed clear preference. [≥]40% abstention was found in seven statements which were excluded from the final set of recommendations. Data conclusionThese recommendations provide a starting point for MRI centres worldwide wishing to perform renal DCE-MRI, contributing to the harmonisation of DCE-MRI scan protocols and facilitating clinical translation. These recommendations provide a practical minimum technical dataset for renal DCE-MRI acquisition and analysis to improve cross-site comparability and support responsible clinical translation.